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This paper implements a data aware early prediction of hypertension-based 
diseases. Automated data preprocessing method that adopts for both 
balanced and unbalanced data is the data aware method included in the 
disease classification algorithm. Proposed data aware data preprocessing 
method is evaluated on the ensemble learning based classification algorithm 
for early disease prediction. Data aware preprocessing method adopts 
isolation forest algorithm for outlier detection as part of the automation. 
Automated sampling method of applying the sampling corresponding to 
either balanced or unbalanced data is adopted. Performance evaluation of the 
proposed data aware algorithm using isolation forest algorithm for anomaly 
detection is experimented. Python based implementation of the proposed 
data aware classification algorithm inferred a better area under the curve 
(AUC) receiver operating characteristics (ROC) curve for isolation forest 
implementation in data preprocessing automation thus developed. While the 
individual classifiers multilayer perceptron classifier approached till 0.918 
(AUC) in the ROC-AUC curve. The ensemble learning algorithm that 
included multilayer perceptron classifier, logistic regression classifier, 
support vector classifier and decision tree algorithm with the isolation forest- 
based anomaly detection algorithm performed better than the individual 
machine learning algorithm with 0.922 (AUC) in the ROC-AUC curve. 
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1. INTRODUCTION 


Diabetes is a disease that leads to multiple other serious diseases including cardiovascular diseases, 
thyroid problems, bone related issues, heart attack due to cardiac arrest. Diabetes and blood pressure are 
related symptoms for numerous cardiovascular diseases. Thus, early detection of diabetes helps the patients 
to avoid getting into serious problems which are fatal in nature. According to a survey conducted by the 
International Diabetes Federation (IDF), in 2020, 77 million people from India are affected by diabetes from 
88 million people affected in Southeast Asia [1]. Overall, 463 million people have diabetes in the world [1]. 
Type 1 diabetes is prevalent in children with India numbering two after the USA and highest in the 
Southeast region [2]. World Health Organization (WHO) records that among the total deaths 2% of deaths 
are due to diabetes in India [3]. Diabetes and hypertension is found to be the common non-communicable 
disease [4], [5]. Diabetes counts to 46.2% and hypertension counts to 4% of total deaths [6]. Metabolic 
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disorder that disturbs the production of the insulin due to blood glucose levels is called Type 2 diabetes [7], 
[8]. Stroke and mortality is evident in the people as risks increases with diabetes inherent in more populations 
in India [9]. By 2030 it is expected to reach 228 million in developing countries [10], [11]. 1 billion adults 
are estimated to suffer with hypertension in developing countries by 2025 [12] and one in every three people 
would die due to hypertension [13]. 

Machine learning algorithms are applied on prediction of hypertension and diabetes [14], [15]. 
Smart sensors and cloud computing based continuous monitoring of vital health signs like electrocardiogram 
(ECG), premature atrial contraction (PAC), alcohol consumption, smoking habits, caffeine intake help early 
detection of the cardiovascular diseases [16]. Heart rate variability (HRV) at different time zones of the day 
is observed to predict the hypertensive patients with higher risks. Decision tree algorithm is applied with 
random under-sampling boosting (RUSBOOST) to train the HRV and demographic features to predict high 
risk patients [17]. Type 2 diabetes and hypertension prediction using the ensemble learning algorithm is 
developed as a mobile application on four different datasets. Risk factor data is observed and intimated to the 
remote server [18]. Occupational related factors are included in the risk assessment of hypertension in the 
steel industry employees in China. Risk of hypertension is evaluated using learning vector quantization 
(LVQ) neural network algorithm and fisher-support vector machine (SVM) algorithm. Variation of accuracy 
with the input data sampling is evaluated to observe the ‘tailing’ process in both the machine learning 
algorithm [19]. A tree-based approach applies different machine learning algorithm to different subsets of 
feature space. Each subset of the feature set is associated to the node in the decision tree developed [20]. 

Traditional learning methods that build the training model from the filtered labels and the noisy 
inferring labels is a straightforward learning method. Unlike the traditional method, the two-stage approach 
developed involves filtering true labels and building a training algorithm. Inference of the filtered labels 
reduces the accuracy of prediction since useful information may get lost. Bootstrapping method creates 
subsets from the total dataset and assigned with the class memberships of the multiple noisy labels. Base 
classifier trains the extended sub-dataset. Other unlabeled instances are predicted using the aggregation 
principle from the outputs of other M base classifiers. Advanced ensemble learning algorithm is found to be 
effective compared to the traditional methods [21]. Multi-tier weighted ensemble learning (MTWEL) is 
developed that optimizes the parameters of all the learning algorithms used as the ensemble using genetic 
algorithm (GA). Heart disease is predicted using the algorithm and good performance is observed [22]. 
Feature importance selection is applied on the features selected for prediction. A combination of k-nearest 
neighbor (KNN) and logistic regression is used for the ensemble learning framework that predict 
cardiovascular diseases. Data imbalance is managed via synthetic minority over-sampling technique 
(SMOTE) method [23]. Independent component analysis (ICA) is used for the dimensionality reduction to 
implement the lung cancer detection algorithm using the AdaBoost based ensemble learning method [24]. 

A meta-learning method combines multiple learning algorithms for a time series prediction 
algorithm using a combination policy of the ensembles. An actor-critic model is developed to optimize the 
weights in the deep reinforcement learning based ensemble learning framework [25]. Gaussian mixture 
model (GMM) based data preprocessing method is used for a power grid environment algorithm [26]. Data 
preprocessing to handle imbalance data for the industrial scenario using a time series prediction is dealt in 
[27]. The streamed data are immediately given to Z-score normalization method to homogenize the data 
range. Time series data is applied with sliding window method to sort before applying the machine learning 
algorithm. Sliding window is used to sort the time series data. A hybrid ensemble learning algorithm is 
introduced for the wind forecasting algorithm. Back propagation, least square support vector machine 
(LSSVM), adaptive neuro-fuzzy inference system (ANFIS) and Elman neural network (ENN) is applied as 
the ensemble learning programs and CLSJaya is used to optimize the weight values [28]. Data preprocessing 
automation at the data center is carried out. Missing value imputation, forecasting replacement value for 
missing values are carried out on the data for the data preprocessing algorithm [29]. Automation of data 
preprocessing, feature extraction and hyperparameter tuning is developed in [30]. 

Previous heart disease prediction algorithms need the data preprocessing automation to be 
implemented to enhance the prediction algorithm capability. An approach that carries out the data 
preprocessing algorithm by implementing the data aware approach is not carried out in the previous 
literature. Data preprocessing algorithms is implemented manually in the previous literatures and even the 
automated data preprocessing algorithms implemented are not a data aware technique in the previous 
literatures. 

This paper discusses the disease prediction algorithm with the data aware data preprocessing 
approach. Automated data preprocessing approach includes the advanced feature importance detection using 
extended isolation forest (EIF) for anomaly detection. An algorithm that automatically detects whether the 
data is balanced or imbalanced and applies the data preprocessing according to the data available is 
developed in this paper. Extended isolation forest-based implementation is compared with the isolation 
forest-based ensemble learning algorithm and performance is compared. Section 2 discusses the proposed 
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data aware ensemble learning methodology with detailed flowchart and details; Section 3 discusses the 
database utilized for the early detection of diabetes; Section 4 discusses the result and discussion for the 
proposed algorithm and comparison results. 


2. PROPOSED DATA AWARE ENSEMBLE LEARNING METHODOLOGY 

Generalization in independent machine learning methods gets a hit due to different capability issues 
inherent to each method. The algorithm that could combine the advantages of different machine learning 
algorithms solves the generalization issue to a greater extent. Isolation forest algorithm is a good anomaly 
detection algorithm. Better anamoly detection implementation using the extended isolation forest algorithm is 
incorporated. The overall implementation details of the proposed implementation is as given in Figure 1. 
Disease prediction implementation is divided into four major parts: i) data preprocessing automation, ii) 
outlier detection and feature engineering, iii) training and testing, and iv) model evaluation. Complete data 
preprocessing process shown in Figure 1 is automated providing a human interference less machine learning 
paradigm. Ensemble learning algorithm increases the generality of the learning algorithm. 


Data Cleaning 
1. Missing Value Imputation 
Data Categorization Finding (Balanced 2. Exoloratory Data Analysis 
and unblanced) 3. Outlier Detection (Extended Isolation 
Forest) 


4. Feature Engineering 


Training and Testing 


1. Sampling According to 
Balanced/unbalanced 


2. Scaling ROC-AUC performance Evaluation 


3. Ensemble with MLP, SVM, Decision 
Tree 


4. Logistics refression 


Figure 1. Overall block diagram-data preprocessing automated ensemble learning 


2.1. Data preprocessing automation 

Automation of data preprocessing involves categorization of data into balanced and imbalanced 
data, data cleaning and anomaly detection. This decides the sampling method of the proposed 
implementation. Missing value imputation by replacing the mean value, cleaning impurities are automated 
after the data is categorized. 


2.2. Outlier detection and feature engineering 

Disease prediction algorithms being the most crucial medical applications, the foolproof nature of 
the early prediction algorithm is a primary indicator for the research thus carried out. A method that can 
handle a large amount of data (in millions) needs a method with higher generality and highly orthogonal 
input data. Possibility of higher correlation among the sample data insists on better feature engineering 
techniques for a better prediction performance. The data aware preprocessing algorithm is a challenge that 
needs to be obtained as the proposed objective. The block diagram insists on an extended isolation algorithm 
to be implemented on the medical diagnosis problem thus chosen. The outlier detection exhibits better 
orthogonality while the extended isolation forest is applied on the medical diagnosis implementation. 


2.3. Training and testing 

Complete dataset of training inputs and target pair is split into training and testing sets with 80% and 
20% ratio respectively. The results obtained from different individual machine learning methods and the 
proposed ensemble learning method are compared for performance evaluation. Ensemble learning method 
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involves the multilayer perceptron (MLP), support-vector machine (SVM) and decision tree (DT) algorithm 
to obtain the first level of output and this output is given to the logistic regression to obtain the final 
classification output. 


3. DATASET AND DATA PROCESSING USED FOR PROPOSED ENSEMBLE LEARNING 

The UCI database [31] of chronic kidney disease is incorporated for the proposed implementation. 
Since diabetes is the early symptom of both of the kidney diseases it is obtained from the repository. The 
dataset has input and output as given in Table 1. 


Table 1. Input and output data from UCI dataset [31] 


Variable Input/Target Type Unit of the variable 
Blood Pressure Input Numerical bp in mm/Hg 
Age Input Numerical in years 
Red Blood Cells Input Nominal Normal/abnormal 
Specific Gravity Input Nominal (1.005,1.010,1.015,1.020,1.025) 
Pus Cell Input Nominal (Present, Not Present) 
Bacteria Input Nominal (Present, Not Present) 
Albumin Input Nominal (0,1,2,3,4,5) 
Hemoglobin Input Numerical hemo in gms 
Packed Cell Volume Input Numerical 
Blood Glucose Random Input Numerical bgr in mgs/dl 
Hypertension Input Nominal Yes/no 
White Blood Cell Count Input Numerical wc in cells/cumm 
Blood Urea Input Numerical bu in mgs/dl 
Appetite Input Nominal (good,poor) 
Serum Creatinine Input Numerical sc in mgs/dl 
Diabetes Mellitus Input Nominal (yes,no) 
Sodium Input Numerical sod in mEq/L 
Coronary Artery Disease Input Nominal (yes,no) 
Potassium Input Numerical potin mEq/L 
Anemia Input Nominal (yes,no) 
Pedal Edema Input Nominal (yes,no) 
Red Blood Cell Count Input Numerical rc in millions/emm 
Pus Cell clumps Input Nominal present, notpresent 
Class ckd, notckd 


4. RESULTS AND DISCUSSION 

The disease prediction algorithm that is carried out for the cardiac disease is the ensemble learning 
algorithm with the extended isolation forest algorithm. The dataset for the proposed implementation is 
obtained from the UCI dataset [31]. For the given dataset effective data preprocessing automation is obtained 
along with the accurate early prediction than the individual learning algorithms. Extended isolation forest 
algorithm with the Ensemble learning algorithm is found to be classifying more accurately with the data 
aware preprocessing algorithm. 


4.1. Preprocessing results 

Toolbox named Pandas is used to understand the data profile and statistical information about the 
dataframe. Continuous variables are estimated for skewness, min, max, standard deviation, percentile is given 
by profile report that defines the statistical understanding of the data. Correlation of each independent 
variable with target variable is found. Reduced dataset that is chosen for training the different algorithms for 
the performance evaluation after preprocessing of the input data are as given in the Table 2. 


Table 2. Input data for training 


Parameter Values 
Number of variables 10 
Number of observations 224 
Total Missing (%) 0.0% 
Total size in memory 17.6 KiB 
Average record size in memory 80.6 B 


The variable types in the data given include the numerical and the categorical data. Table 3 depicts 
the number of different variables. After the variables are converted to numerical variable and target is 
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mapped with the input using the ensemble learning method. Data visualization is carried out as the first step 
by the data aware algorithm. Outlier based analysis is carried out to provide the idea about the data. Box plot 
of each data from the dataset is as shown in the Figure 2. Box plot visualizes each variable in the data set. It 
depicts the range of values for each data in the data set. 


Table 3. Variable types 

Variable Type Number of Variables 
Numeric 5 
Categorical 
Boolean 
Date 
Text (Unique) 
Rejected 
Unsupported 
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Figure 2. Outlier detection for different input variables 


4.2. Outlier results 

The outlier detection for diabetes dataset using isolation forest is shown in Figure 3. The normal and 
the predicted outliers from the extended isolation forest is as given in the Figure 3(a). The outliers found 
from this isolation forest algorithm will be removed. The 3D graph of the outliers in the variable space is as 
shown in Figure 3(b). It can be clearly seen that the data can be isolated as outlier and inliers as shown in 
Figure 3(b). The training can improve the classification performance as shown in the results obtained. 
Outliers detected from the isolation forest algorithm are conveniently removed to obtain better accuracy in 
training procedure. Isolation forests for both the male and the female datasets are obtained for visualization. 
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The outlier detection for female dataset using Isolation Forest is shown in Figure 4. Figure 4(a) 
shows the normal and the predicted outliers for the female data from the dataset. The 3D graph of the outliers 
in female in the variable space is as shown in Figure 4(b). The outlier detection for male dataset using 
Isolation Forest is shown in Figure 5. Figure 5(a) shows the normal and the predicted outliers for the male 
data from the dataset. The 3D graph of the outliers in male in the variable space is as shown in Figure 5(b). It 
can be clearly seen that the data can be isolated as outlier and inliers as shown in Figure 5(b). The training 
can improve the classification performance as shown in the results obtained. Anomaly detection from the 
isolation forest algorithm has clearly defined the inliers and outliers in the data and outliers are removed. 


4.3. AUC-ROC results 

In order to compare the performance evaluation of the proposed algorithm, logistics, MLP classifier, 
decision tree classifier, switched virtual circuit (SVC) are 4 classifiers used and ensembled using stacking CV 
classifier. Imbalanced data is handled by using SMOTE which is decided due to the data categorization. Data 
is a balanced data. Classification algorithm relies on the area under the curve and receiver operating 
characteristic curve (AUC-ROC) curve for its performance evaluation. Degree of separability between the 
classes are defined by AUC values and ROC is the probability curve. AUC indicates how better it 
distinguishes between different classes. 
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Figure 3. Outlier from isolation forest (a) isolation forest outliers 2D and (b) isolation forest outliers 3D 
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Figure 4. Outlier detected on female data (a). isolation forest outliers 2D and (b) isolation forest outliers 3D 
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Figure 5. Outlier detected on male data (a) isolation forest outliers 2D and (b) isolation forest outliers 3D 


Figure 6 shows the AUC ROC curves for linear regression (LR), MLP, decision tree, SVC and stack 
(Ensemble) method. It is observed from the Figure 6 that the AUC is as high as 0.922 compared to other 
machine learning algorithms. In the individual machine learning algorithm MLP scored better AUC with 


0.918. 
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Figure 6. AUC ROC curves for LR, MLP, decision tree, SVC and stack (ensemble) method 


5. CONCLUSION 
The results obtained from the different outlier detection methods and the sampling methods are 


discussed and compared and performance evaluations of the proposed methods are discussed. A suggestion 
mechanism for the best ensemble classification for the medical diagnosis is implemented. Multilayer 
perceptron classifier approached till .918(AUC) in the ROC-AUC curve. The ensemble learning algorithm 
that included multilayer perceptron classifier, logistic regression classifier, support vector classifier and 
decision tree algorithm with the isolation forest-based anomaly detection algorithm performed better than the 
individual machine learning algorithm with .922 (AUC) in the ROC-AUC curve. For the given dataset 
effective data preprocessing automation is obtained using ensemble approach along with the accurate early 
prediction than the individual learning algorithms. 
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